Assignment 3

The objective is

To implement Cocke-Kasami-Younger (CKY) algorithm for bottom-up CFG parsing, and apply it to the word and the parsing problem of English.

Ingredients that we have to represent or manipulate using NLTK, are:

Grammar
Test sentences
CKY algorithm

Few notes about Assignment statement

To run examples from the statement of the assignment you need atis grammar, if atis grammar is not available, execute nltk.download() and select large_grammars in tab models and press "install".

The following line of the statement is wrong!



In [ ]:

    
# parse all test sentences
for sentence in s:
    parser.chart_parse(sentence[0]

That's because s is a long string with all the sentences joined. You have to use t that is a list of tuples, where each element of the list corresponds to a sentence, a tuple whose first element is a list of words in the sentence.



In [ ]:

    
# parse all test sentences
for sentence in t:
    try:
        parser.chart_parse(sentence[0])
    except ValueError:
        pass

I explain this because is quite frustrating to try the most basic commands and that they don't work.

The try-except statement it is necessary because some sentences had tokens that are not available in the grammar, this rise an exception that stops program execution. This statement prevents program execution stops.

This is a mistake in the assignment wording.

To make easier the execution, move atis-grammar-cnf.cfg and atis-test-sentences.txt to the same folder as the script. They are modified versions of files that can be downloaded using ntlk.download() and that are located on nltk_data folder of your home.

Files of atis large-grammar are useful because atis_sentences.txt files has the number of parse trees corresponding to each sentence, this information is handy to check our algorithms.

Loading grammar and sentences

Well, let's start! Begin loading the grammar:



In [1]:

    
import nltk
g = nltk.data.load("atis-grammar-cnf.cfg")



In [2]:

    
print type(g)
g









    



<class 'nltk.grammar.CFG'>






    Out[2]:





<Grammar with 20326 productions>

As the parameter of nltk.data.load() has the .cfg extension, NLTK read it as a grammar.

Then load the sentences:



In [3]:

    
s = nltk.data.load("atis_test_sentences_a3.txt", "raw")
s[:100]









    Out[3]:





'what is the cheapest one way flight from columbus to indianapolis .\nis there a flight from memphis t'



In [4]:

    
t = nltk.parse.util.extract_test_sentences(s)
t[:3]









    Out[4]:





[(['what',
   'is',
   'the',
   'cheapest',
   'one',
   'way',
   'flight',
   'from',
   'columbus',
   'to',
   'indianapolis',
   '.'],
  None),
 (['is',
   'there',
   'a',
   'flight',
   'from',
   'memphis',
   'to',
   'los',
   'angeles',
   '.'],
  None),
 (['please',
   'show',
   'me',
   'the',
   'flights',
   'from',
   'chicago',
   'to',
   'detroit',
   'that',
   'arrive',
   'at',
   'six',
   'p.m.',
   'next',
   'tuesday',
   '.'],
  None)]

As you can see, the function nltk.parse.util.extract_test_sentences() splits s by \n as separator of sentences, and then split each sentence using whitespace as separator of words.

None values indicate that there are not information on whether the sentence is grammatical or not, or the number of parse trees that should have.

Production table

First, we create a production_table that is a dict of lists such that the keys are productions right-hand-sides of the given grammar, and the values the corresponding production.

This a little trick that would be necessary for CKY algoritms implementations.



In [5]:

    
from assignment3 import init_production_table
init_production_table??



In [6]:

    
pt = init_production_table(g)
print pt[('VERB_VB', 'BUM')]









    



[IMPR_VB -> VERB_VB BUM, SIGMA -> VERB_VB BUM]

CKY recognizer

In Jurafsky and Martin (2009) we can get the following pseudocode for the CKY algorithm:



In [ ]:

    
function CKY-PARSE (words, grammar) returns table
    for j ← from 1 to LENGTH(words) do
        table[j − 1, j] ← {A | A → words[j] ∈ grammar }
        for i ← from j − 2 downto 0 do
            for k ← i + 1 to j − 1 do
                table[i,j] ← table[i, j] ∪
                    {A | A → BC ∈ grammar, B ∈ table[i, k], C ∈ table[k, j]}

Then the implementation is almost straightforward in Python using NLTK objects.



In [7]:

    
from assignment3 import cky_recognizer
cky_recognizer??

In cky_recognizer() we start to define the table:



In [ ]:

    
n_tokens = len(tokens)
    table = [[None for j in range(n_tokens + 1)] for i in range(n_tokens)]

"words" in the pseudocode corresponds to tokens, and "grammar" corresponds to cnf_grammar.

The loop

for j ← from 1 to LENGTH(words) do

is implemented as:



In [ ]:

    
for j in range(1, n_tokens + 1):

LENGTH(words) corresponds to n_tokens.

Remember that range(i, j) goes from i to j-1, therefore is added one to n_tokens.

The expression:

    table[j − 1, j] ← {A | A → words[j] ∈ grammar }

Means that left-hand-sides of the productions with word[j] in right-hand-side that belong to the grammar are assigned at the position [j - 1, j] of the table (the diagonal).



In [ ]:

    
productions = cnf_grammar.productions(rhs=tokens[j-1]) # productions with tokens[j-1] (word[j]) in rhs
        lhs_list = [] # list of left-hand-side
        for production in productions:
            lhs_list.append(production.lhs())
        table[j - 1][j] = lhs_list

The loop

    for i ← from j − 2 downto 0 do

Corresponds to:



In [ ]:

    
for i in range(j - 2, -1, -1):

And

        for k ← i + 1 to j − 1 do
            table[i,j] ← table[i, j] ∪
                {A | A → BC ∈ grammar,
                     B ∈ table[i, k],
                     C ∈ table[k, j]}

Means that left-hand-sides of the productions with B and C in right-hand-side that belong to the grammar, are assigned at the position [i, j] of the table (upper right corner). Where, B and C are elements to the "right" and "below" of position [i, j] respectively.

"the contens of two cells can be combined in a way that is sanctioned by a rule in the grammar. If such a rule exists, the non-terminal on its left-hand side is entered into the table." (Jurafsky & Martin, 2009)



In [ ]:

    
lhs_list = []
            # Range over the places the string can be split
            for k in range(i + 1, j):
                right = table[i][k] # Right along row i
                down = table[k][j] # Down along column j
                for r_lhs in right:
                    for d_lhs in down:
                        key = (str(r_lhs), str(d_lhs))
                        if production_table.has_key(key):
                            for production in production_table[key]:
                                lhs_list.append(production.lhs())

Here you can see the production_table trick. It's an easy way to obtain the productions that corresponds to a given pair of nonterminals.

Finally, the function returns the table.



In [ ]:

    
return table

Is in CFG language

To determine whether the sentence is in the CKY language we can use is_in_CFG_language() function.

This function basically verifies that at least one nonterminal in the upper right corner of the table corresponds to the grammar start (in this case 'SIGMA').



In [8]:

    
from assignment3 import is_in_CFG_language
is_in_CFG_language??

Let's try 40 sentences from atis-test-sentences.txt, that are selected in the file "atis_test_sentences_a3.txt". Sentences with more than 50 trees were avoided. There are 3 cases:

Sentences that are in the language
Sentences not recognized by language
Sentences whose tokens are not covered by grammar



In [9]:

    
for i, sentence in enumerate(t):
    table = cky_recognizer(sentence[0], g, pt)
    if table != []:
        if is_in_CFG_language(table, g):
            print 'sentence {} is in language'.format(i + 1)
        else:
            print 'sentence {} not recognized'.format(i + 1)
    else:
        print 'sentence {} has tokens not covered by grammar'.format(i + 1)









    



sentence 1 is in language
sentence 2 is in language
sentence 3 is in language
sentence 4 not recognized
sentence 5 not recognized
sentence 6 is in language
sentence 7 not recognized
sentence 8 is in language
sentence 9 is in language
sentence 10 is in language
sentence 11 is in language
sentence 12 is in language
sentence 13 is in language
sentence 14 not recognized
sentence 15 is in language
sentence 16 has tokens not covered by grammar
sentence 17 is in language
sentence 18 not recognized
sentence 19 is in language
sentence 20 is in language
sentence 21 has tokens not covered by grammar
sentence 22 not recognized
sentence 23 not recognized
sentence 24 is in language
sentence 25 is in language
sentence 26 is in language
sentence 27 is in language
sentence 28 is in language
sentence 29 is in language
sentence 30 is in language
sentence 31 is in language
sentence 32 is in language
sentence 33 is in language
sentence 34 is in language
sentence 35 has tokens not covered by grammar
sentence 36 not recognized
sentence 37 not recognized
sentence 38 is in language
sentence 39 is in language
sentence 40 is in language

Examples:

Sentence 4: "does united flight four seven four slash fourteen eighty four serve dinner ." is not recognized, because none of the nonterminals of the upper level is the start of grammar.
Sentence 16: "list these city destinations ." has the token "destinations" that is not covered by the grammar.

CKY parser

The main diference between the recognizer and the parser is that where before we had left sides of productions now have trees (of type nltk.Tree).

Where we had in recognizer:



In [ ]:

    
productions = cnf_grammar.productions(rhs=tokens[j-1])
        lhs_list = []
        for production in productions:
            lhs_list.append(production.lhs())
        table[j - 1][j] = lhs_list

In parser we have:



In [ ]:

    
productions = cnf_grammar.productions(rhs=tokens[j-1])
        trees = []
        for production in productions:
            trees.append(nltk.Tree(production.lhs(), [production.rhs()]))
        table[j-1][j] = trees

Note that the tree is an structure such that the parent node corresponds to the production left-hand-side, and the children nodes corresponds to the production right-hand-side.

In a similar way, where in recognizer we had:



In [ ]:

    
lhs_list = []
            # Range over the places the string can be split
            for k in range(i + 1, j):
                right = table[i][k] # Right along row i
                down = table[k][j] # Down along column j
                for r_lhs in right:
                    for d_lhs in down:
                        key = (str(r_lhs), str(d_lhs))
                        if production_table.has_key(key):
                            for production in production_table[key]:
                                lhs_list.append(production.lhs())
            table[i][j] = lhs_list

Now, in parser we have:



In [ ]:

    
trees = []
            # Range over the places the string can be split
            for k in range(i + 1, j):
                right = table[i][k] # Right along row i
                down = table[k][j] # Down along column j
                for r_tree in right:
                    for d_tree in down:
                        key = (str(r_tree.label()), str(d_tree.label()))
                        if production_table.has_key(key):
                            for production in production_table[key]:
                                trees.append(nltk.Tree(production.lhs(),
                                                       [r_tree, d_tree]))
            table[i][j] = trees

In this last snippet of code, note that the magic is that r_tree and d_tree are inserted as children nodes, so that at the top level we have complete trees for the whole sentence.



In [ ]:

    
trees_found = []
    for tree in table[0][n_tokens]:
        if str(tree.label()) == str(cnf_grammar.start()):
            trees_found.append(tree)

    return trees_found

In this case, we return a list of trees, those which top node coincides with the grammar start ('SIGMA').



In [10]:

    
from assignment3 import cky_parser
cky_parser??



In [ ]:

    
f = open("atis_n_trees.txt", "w")
for sentence in t:
    trees = cky_parser(sentence[0], g, pt)
    f.write("{}\t{}\n".format(len(trees),' '.join(sentence[0])))
f.close()



In [ ]:

    
trees = cky_parser(t[5][0], g, pt)
for tree in trees:
    tree.draw()



In [ ]: